Final Report
Data Science 2 with R (STAT 301-2)
Introduction
This project aims to create a predictive model that can estimate the yearly salaries of professional basketball players from the National Basketball Association (NBA). This model uses season statistics from players as its predictors. Other NBA-related factors, such as the location of the player’s team and how long the player has been in the NBA, are also used for the analysis.
This research question is a regression problem: we are trying to predict salary, a continuous outcome variable. Since we are comparing salaries across several decades, we need to make sure our measurements use one inflation rate. Therefore the target variable measures a player’s yearly salary adjusted to 2023 prices using the Consumer Price Index for All Urban Consumers from the US Bureau of Labor Statistics..
I see this research most benefiting NBA players. With this model, players would have a better understanding of what they should expect from contract offers based on their previous performance. The model also helps indicate what factors outside of the players’ control, such as what conference a team plays in, contribute towards their salaries.
Besides its player-specific benefits, this model allows me to explore the NBA computationally. I am a great fan of the NBA (especially my hometown Chicago Bulls), so creating this model has been fun and informative of historical season statistics. Observing what statistics are highly valued when a contract is devised also gives me a deeper understanding of how organizations form championship-contending teams.
Data Overview
Before proceeding to methodologies, it is important to check the quality of our data. Doing so will clearly lay out concerns we have about the raw data, driving how recipes, model comparison, and tuning parameters are selected and modified. Below, I use the entire dataset to explore missing values and our target variable.
Missingness
Table 1 presents the number and percent of missingness for each variable. We can see that all missing observations come from the _percent variables. These are variables that measure the shooting percentages of players.
We can interpret this missingness as players who never took a 2-point shot, 3-point shot, or either during a season. The 3-point shot missingness does not bother me since historically players have contributed greatly to a team while never taking a 3-point shot in a season. For example, during his MVP season in 1999-2000, center Shaquille O’Neal did not attempt a single 3-point shot in the regular season or during the Los Angeles Lakers’ championship playoff run. These NAs can be replaced with 0s. We can identify players who contributed greatly to their team without shooting a 3 by filtering observations who only have missingness in the 3-point percentage category.
Those without a 2-point attempt or free-throw attempt concern me, as these players also have 0 or near 0 statistics for every other numerical predictor. Table 2 gives a sample of the lack of data for observations in this group.
It does not make sense to include these observations in recipes since their values are not a strong representation of how statistics change salaries. Also considering that these observations make up a small part of the dataset, the best choice is to drop them from the dataset.
Target Variable Exploration
We start the univariate analysis of our target variable by looking at a histogram of adjusted salaries. Figure 1 shows this distribution. From the histogram, we can see that the majority of NBA players earn less than five million dollars in a year. The distribution does not seem to have any additional local peaks.
However, our distribution is right skewed. Ideally, we want our outcome variable to have a normal distribution to allow us to apply statistical properties and techniques that require a normality assumption. Reducing skewness will also make predicting values easier in our model. A common transformation for right-skewed data is a log transformation. This transformation will help reduce the skewness and deal with extreme values. The density distribution and boxp lot of adjusted salaries in Figure 2 give another perspective of the right-skewness of the data.
After using the log transformation, our data looks a little left-skewed. Nevertheless, we have reduced the skewness, allowing for easier analysis. Figure 3 shows this left-skewness.
We can reduce the skewness of our data even more by considering a more uncommon transformation. I found that transforming the outcome variable by the 7th root essentially removes the skewness. This transformation is visualized in Figure 4.
While this transformation gives us a distribution close to normal, we also need to consider the interpretability of using this transformation; Explaining the findings of our model to an NBA player or agent with this transformation would be difficult. However, interpretability should not be a large concern for us since we can transform the results of our model back to conventional units. Therefore, with the desire for normality in mind, I decided to transform adjusted salaries by the 7th root.
Methods
Again, this prediction model is a regression problem. I use RMSE as an assessment metric for this project. In this regression analysis, I am most concerned about the accuracy of my models’ estimation; I want to be able to predict NBA salaries as closely as possible, as this is the main value a player or agent is concerned with. RMSE also helps to resolve issues with outliers since the metric penalizes these observations.
Data Splitting
Before splitting, our data set has 13985 observations. This data set is on the smaller side, so a split should lean more towards having many training observations. Of course, we do not want our model to overfit the training data. I chose to do a .75/.25 training-testing split, as I believed this to be a good median proportion for the data. Our training set has 10440 observations, and our testing set has 3483 observations.
Resampling Technique
I chose a v-fold cross-validation as my resampling method. The method has 10 folds repeated 5 times. I chose to use 10 folds to allow for an equilibrium between bias and variance. Again, given our data set is on the smaller side, it is not necessary to explore a larger amount of folds, as a higher value would likely create a very high variance. Repeating 5 times allows us to diminish the noisiness of our data. This technique is particularly useful for the standard errors’ accuracy within our performance metric.
When folding, our model fits on 9396 observations and is assessed on about 1044 observations. I find these numbers to be reasonable; we have a nice number of observations for the model itself and ample observations for assessment. Through data collection, I noticed that each NBA season has about 500 players. So, we can think of this method as assessing with two seasons’ worth of players.
Model Types
My project will use the null model, linear regression model, elastic net model, random forest model, gradient boosted tree model, neural network regression model, and multivariate adaptive regression splines model. A key reason for my choices was to continuously increase flexibility while examining the change in the interpretability of my results. Also, many of the predictor variables have high correlations with each other. A few examples of these correlations are seen in Table 3:
| term | mp | fg | fga |
|---|---|---|---|
| mp | NA | 0.8913389 | 0.8951823 |
| fg | 0.8913389 | NA | 0.9810554 |
| fga | 0.8951823 | 0.9810554 | NA |
High correlation can lead to unstable estimates and overfitting in my models, so I address this issue through my choices of methodology. An expanded analysis of this observation can be found in the EDA portion of the appendix.
Null and Linear
The null and linear regression models are used mainly as a baseline. These models give a simple, trivial result to my data. We can assume that there are complexities in our data that need to be accounted for to produce the best performance metric. However, having baselines gives us a good indication of whether or not our other models are being hurt by complexity.
Keeping in mind the simplicity of the models, the null and linear regression models do not have hyperparameters to tune.
Elastic Net
The elastic net model will address some of the high correlation and multicollinearity concerns I have with the data; Since the elastic net model uses the lasso and ridge techniques to penalize multicollinearity issues, this model is great for my analysis.
I will tune two hyperparameters for this model. The mixture hyperparameter allows us to test different proportional combinations of the lasso and ridge techniques used for penalization. The penalty hyperparameter allows us to vary the punishment a model incurs for multicollinearity issues.
Random Forest
The random forest model is a nice median between interpretability and flexibility. The model handles overfitting concerns and outliers well. In addition, this model can handle both linear and nonlinear relationships. Exploring the training set, I saw a few predictors that may have a nonlinear relationship with the outcome variable. For example, Figure 5 shows that the blocks predictor does not seem to have a steady, linear trend.
More examples are found in the EDA. If these visualizations are truly nonlinear, random forests will handle these relationships with little tuning. Concerning tuning, I vary three hyperparameters. First, I vary the number of predictors used for each decision tree. It is important to keep in mind that lower values of this hyperparameter could cause overfitting. Next, I tune the number of decision trees used in the prediction model. Allowing for a larger amount of trees should lead to more robust predictions at the cost of longer computation times. Finally, I vary the minimum number of nodes needed in each node. It is best to keep this value on the lower side, as lower values lead to more complexity required to capture the patterns of our data.
Gradient Boosted Tree
Boosted trees allow for more complexity in our data than random forests, potentially creating more accurate performance metrics. However, the hyperparameters are quite sensitive; If the parameters are not tuned correctly, the results will not be worthwhile. This model will tune the same hyperparameters as in the random forest model. Additionally, boosted trees tunes the rate at which the model learns from previous iterations of itself. Using a lower value for this hyperparameter decreases the chance of overfitting since the weights of each tree are smaller. However, a lower value means a higher number of trees to achieve robust results.
MARS
I am choosing to try the multivariate adaptive regression splines (MARS) model mainly for its ability to deal with nonlinearity while still having high interpretability compared to other models with a heavy nonlinearity focus. However, if our nonlinear data has many sharp changes (say one of our predictor-outcome relationships exhibits multiple peaks and troughs), the piecewise linear segments that MARS uses to capture its nonlinear relationships may not be accurate.
For this model, I am tuning two hyperparameters. The first parameter varies the number of terms that are used in the final prediction model. Increasing this value can allow us to capture more complexity. However, increasing this number too much will result in overfitting. The second hyperparameter varies the degree of the interaction term in the model. Similar to the first parameter, increasing the allowed degrees helps with capturing complexities in the model at the cost of potentially overfitting.
Neural Network
The neural network regression model will provide more nonlinear flexibility than MARS if the intricacy of our data becomes a problem. While the model can create estimations with high accuracy, this process is very time-consuming, with multiple hyperparameters needing to be tuned. Since our data set is on the smaller side, it may be the case that neural networks actually overfit our data.
Three hyperparameters are tuned for this model. First, we vary the weight decay of the model, with higher weights preventing the model from learning overly complex patterns that are specific to the training set. Next, we adjust the number of neurons used in each layer of the model, with more neurons leading to more complexity and more potential for overfitting. Finally, we modify the number of training iterations on the entire training set that the model performs. Following the same pattern as the previous hyperparameters, larger values mean more complexity but more potential for overfitting.
Recipes
For this project, I use four separate recipes to conclude the best-performing model. I want to examine how well various predictors and techniques do for estimating salaries. Here, I give explanations for my recipe decisions.
For all recipes, predictors with zero variance were removed. A predictor with zero variance has no distinguishing factor between observations, therefore not positively contributing to the recipe. To place all predictors on the same scale, we normalize the predictors. Also, we drop predictors that have high correlations with other predictors. Again, high correlations can lead to multicollinearity, resulting in overfitting of the model. In general, correlations over .7 are considered high, so we remove any predictors that have correlations above this threshold.
General
The first recipe focuses on general player statistics, such as points, assists, and rebounds. These are the predictors that average person thinks of when they imagine basketball statistics. This recipe is simplistic and minimizes the complexity of the modeling process; Only a handful of interaction terms are used and no nonlinear trends are addressed. I imagine we will get fairly reasonable predictions with this recipe, but more complexity will likely improve the results. After removing high correlations and zero-variance predictors, we end up with 21 predictors in this recipe. Table 4 shows the predictors that end up used in this recipe:
Interaction
The second recipe will place a heavy focus on interaction terms. There are several predictor variables where we may expect differential effects to occur. For example, Figure 6 visualizes that there is a positive correlation between games started and salaries. Moreover, players who have been in the NBA for 10 or more years have higher salaries at every level along the x-axis. So, we should create a predictor that is the interaction between games played and whether a player has been in the NBA for 10 or more years.
Another key difference between this recipe and the general recipe is the use of percentage variables. For example, instead of using the variable for field goals, I use the variable for field-goal percentage. Here, I am trying to see if differing the form of the variable changes the outcome of the model. I do not expect this change to have significant effects, as we can find these types of variables through combinations of other variables. All in all, the interactions recipe has 33 predictors, as seen in Table 5.